BMC Medical Informatics and Decision Making — Latest Matching Preprints

1

Deep Learning-Based Missing Value Imputation for Heart Failure Data from MIMIC-III: A Comparative Study of DAE, SAITS, and MICE+LightGBM

sharma, s.; KAUR, M.; GUPTA, S.

2026-02-11 health systems and quality improvement 10.64898/2026.02.10.26345979 medRxiv

Top 0.1%

33.3%

Show abstract

BackgroundElectronic Health Records(EHR) are very crucial for Clinical Decision Support Systems and for proper care to be delivered to ICU heart failure patients, there is often missing data due to monitoring device errors thus the need for robust imputation methodologies. ObjectiveTo compare and evaluate three different methodologies for imputing missing data for heart failure patients from the MIMIC-III database: Denoising Autoencoder (DAE), Self-Attention Imputation for Time Series (SAITS), and Multiple Imputation by Chained Equations (MICE) with LightGBM. MethodsAnalysis of 14,090 ICU admissions for patients with heart failure was performed using data from the MIMIC-III database. Features were selected based off of clinical relevance, and 19 clinical features were selected through a combination of Random Forest analysis, correlation analysis, and Mutual Information. The introduction of artificial missing values of 20%, 30%, and 50% was applied to the data set, and then 3 imputation methodologies were evaluated with the DAE, SAITS, and MICE+LightGBM. The performance of each imputation methodology was evaluated using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Normalized Root Mean Square Error (NRMSE). ResultsBoth DAE and SAITS had superior performance on the imputation of missing values across all percentages of missing values. At 20% missingness, DAE had mean MAE = 0.004967, RMSE = 0.005217, and NRMSE = 3.260893 while SAITS had mean MAE = 0.005461, RMSE = 0.005797, and NRMSE = 3.244695; thus MICE+LightGBM resulted in a higher number of errors. At 50% missingness, the SAITS methodology demonstrated the best performance followed by DAE and MICE+LightGBM methods demonstrated decreased performance. The deep learning methodologies maintained a consistent level of accuracy between the clinical variables measured. ConclusionsOur analysis indicates that deep learning-based imputation methodologies significantly outperform traditional methodologies for imputing missing values in ICU heart failure data thus supporting the implementation of these methodologies into Clinical Decision Support Systems for heart failure patients.

2

Towards clinical implementation of artificial intelligence in cancer care: Concept mapping analysis of provincial workshop findings

Nayyar, C.; Xu, H. H.; Bates, A. T.; Conati, C.; Hilbers, D.; Avery, J.; Raman, S.; Fayaz-Bakhsh, A.; Nunez, J.-J.

2026-03-27 health systems and quality improvement 10.64898/2026.03.26.26349205 medRxiv

Top 0.1%

28.5%

Show abstract

Background: Artificial intelligence (AI) has rapidly garnered interest in healthcare, with research showing promise to improve quality, efficiency, and outcomes. Cancer care's multidisciplinary nature and high coordination demands are well positioned to benefit from AI. While attitudes in the uptake of evidence and toward the implementation of AI in medicine has been explored generally, literature remains scarce with specific regards to AI in cancer care. This study sought to understand how perspectives of both patients and professionals are essential for guiding responsible, effective implementation of evidence-based (EB) AI in cancer care. Methods: We conducted a workshop at the 2024 British Columbia (BC) Cancer Summit (Vancouver, Canada). Discussions addressed three guiding questions: concerns, benefits, and priorities for AI in cancer care. Responses from 48 workshop participants (patients and families, AI/computer science/cancer researchers, clinicians and allied health professionals, information technology professionals, healthcare administrators) underwent structured conceptualization by concept mapping, leveraging multidimensional scaling and hierarchical cluster and subcluster analysis to produce visual and quantitative maps of stakeholder priorities. Results: A total of 265 statements on perceived benefits, concerns, and priorities related to the implementation of AI in cancer care were generated from the workshop and underwent concept mapping. Two clusters were identified; Cluster 1 focused on "Challenges and Safeguards for AI Implementation," and Cluster 2 focused on "Clinical Benefits and Efficiency Gains." Subcluster analysis distinguished 8 thematic subclusters (4 per cluster). Both mean importance (P < .001) and feasibility (P < .001) ratings were significantly higher for Cluster 2. No differences were found between ratings by clinical and nonclinical professionals. Further go-zone analysis classified statements according to their relative superiority/inferiority in importance and feasibility compared to the overall average. Conclusions: Stakeholder ratings were higher for statements describing clinical benefits and efficiency gains than for those describing challenges and safeguards for AI implementation in cancer care. Concept mapping analysis distinguished between workflow-aligned AI applications, perceived as ready for implementation, and system-level governance requirements requiring longer-term investment. Present findings provide a structured, stakeholder-informed framework for prioritizing and sequencing AI implementation efforts in cancer care, constituting a practical blueprint to catalyze meaningful progress.

3

Evaluating Redundancy and Biases in EHR Social Determinants of Health Data Screening

Powers, J. P.; Shaheen, A.; Entwisle, B.; Pfaff, E.

2026-02-19 health systems and quality improvement 10.64898/2026.02.18.26346575 medRxiv

Top 0.1%

25.9%

Show abstract

IntroductionHealthcare organizations have begun incorporating screening procedures for social determinants of health (SDOH) into care, recognizing the impact these factors can have on health outcomes. We aimed to present methods for evaluating redundancy in the risk information gained across SDOH questions and for evaluating whether demographic biases are present in whether patients were asked SDOH questions and whether they declined to answer them. MethodsSDOH question data were analyzed for 1.8 million UNC Health patients. To evaluate risk information redundancy, response agreement was analyzed for pairs of questions. Demographic biases were evaluated using logistic regression models. ResultsRisk information redundancy was identified, particularly across food and financial insecurity questions. Furthermore, female and White patients were more likely to be asked some questions than other groups, and American Indian or Alaska Native and Hispanic or Latino patients were less likely to decline to answer questions. ConclusionsWe demonstrated methods healthcare organizations can use to evaluate their SDOH screening procedures. These methods yielded insights for (1) reducing burden in clinical workflows by identifying where redundancy could be eliminated and (2) reducing bias in SDOH data collection through more systematic screening protocols.

4

Improving Medicare Fraud Detection Accuracy in Deep Learning by Exploring Feature Selection and Data Sampling Techniques.

Ahammed, F.

2026-03-20 health informatics 10.64898/2026.03.18.26348763 medRxiv

Top 0.1%

23.3%

Show abstract

Fraud in the health landscape is an aggravating issue, with far-reaching consequences burdening the financial stability of the health industry and threatening the quality of medical care. It results from vulnerabilities within the current healthcare framework that are exploited by the fraudsters in their favor. In spite of many developed models that aim to detect fraudulent patterns in insurance claims, the accuracy of such models frequently suffers as a result of the imbalance issue of the Medicare dataset and irrelevant features. This study ventures to improve detection performance and accuracy by employing a deep learning model along with data sampling and feature selection techniques. Comparative analysis among different combinations is conducted to determine their efficacy to enhance the accuracy of the fraud detection model. Hence, the suggested model clearly demonstrates that a combination of myriad data sampling and feature selection techniques is helping to improve accuracy and performance. The accuracy was thus 95.4%, with negligible evidence of overfitting detected using both Chi-square and Synthetic Minority Over-sampling (SMOTE) techniques. Ultimately, the study findings underscore the significance of employing combined techniques instead of using only the baseline deep learning model for better performance in detecting Medicare insurance fraud.

5

Augmenting Electronic Health Records for Adverse Event Detection

Kaynar, G.; You, Z.; Boyce, R. D.; Yakoh, T.; Kingsford, C.

2026-02-11 health informatics 10.64898/2026.02.10.26345962 medRxiv

Top 0.1%

22.7%

Show abstract

ObjectiveAdverse events (AEs) resulting from medical interventions are significant contributors to patient morbidity, mortality, and healthcare costs. Prediction of these events using electronic health records (EHRs) can facilitate timely clinical interventions. However, effective prediction remains challenging due to severe class imbalance, missing labels, and the complexity of EHR records. Classical machine learning approaches frequently underperform due to insufficient representation of minority adverse event classes and limited capacity to capture interactions among patient demographics, administered medications, and associated complications. MethodsWe introduce TASER-AE, a novel data augmentation pipeline tailored for structured EHR data, coupled with transformer-based classification. TASER-AE addresses these issues through an NLP-inspired data augmentation framework adapted for EHR, enabling effective minority-class representation in sparse and imbalanced clinical datasets. The augmented records produced by TASER-AE alleviate class imbalance by enriching the representation of minority adverse event classes, which enhances the robustness and predictive performance of the classifier. ResultsTASER-AE yields minority-class F1 scores up to 0.70, substantially surpassing classical machine-learning baselines and prior augmentation methods across multiple adverse event tasks. Experiments conducted on two distinct EHR datasets confirm TASER-AEs ability to substantially improve adverse event detection performance. ConclusionThese results demonstrate the potential of structured, NLP-inspired augmentation methods to overcome data limitations in clinical predictive modeling, ultimately contributing to improved patient safety outcomes. TASER-AE is available at https://github.com/Kingsford-Group/taserae.

6

Class imbalance correction in artificial intelligence models leads to miscalibrated clinical predictions: a real-world evaluation

Roesler, M. W.; Wells, C.; Schamberg, G.; Gao, J.; Harrison, E.; O'Grady, G.; Varghese, C.

2026-03-05 health informatics 10.64898/2026.03.04.26347634 medRxiv

Top 0.1%

18.1%

Show abstract

BackgroundPredictive models employing machine learning algorithms are increasingly being used in clinical decision making, and improperly calibrated models can result in systematic harm. We sought to investigate the impact of class imbalance correction, a commonly applied preprocessing step in machine learning model development, on calibration and modelled clinical decision making in a large real-world context. MethodsA histogram boosted gradient classifier was trained on a highly imbalanced national dataset of >1.8 million patients undergoing surgery, to predict the risk of 90-day mortality and complications after surgery. Class imbalance correction strategies including random oversampling, synthetic minority oversampling technique, random under-sampling, and cost-sensitive learning were compared to the natural distribution ( natural). Models were tested and compared with classification metrics, calibration plots, decision curve analysis, and simulated clinical impact analysis. ResultsThe natural model demonstrated high performance (AUROC 0.94, 95% CI 0.94-0.95 for mortality; 0.84, 95% CI 0.84-0.85 for complications) and calibration (log loss 0.05, 95% CI 0.04-0.05 for mortality; 0.23, 95% CI 0.23-0.24 for complications). Class imbalance mitigation (CSL, ROS, RUS, and SMOTE) did not improve AUROC or AUPRC but increased recall and F1 scores at the expense of precision and accuracy. However, these methods severely compromised model calibration, leading to significant over-prediction of risks (up to a 62.8 % increase) as further evidenced by increased log loss across all mitigation techniques. Decision curve analysis and clinical scenario testing confirmed that the natural model provided the highest net benefit. ConclusionClass imbalance correction methods result in significant miscalibration, leading to possible harm when used for clinical decision making.

7

Predicting Rectal Cancer Patient Survival with Dutch Radiology Reports using Natural Language Processing (NLP): The Role of Pretrained Language Models

Cai, L.; Zhang, T.; Beets-Tan, R.; Brunekreef, J.; Teuwen, J.

2026-01-30 health informatics 10.64898/2026.01.23.26344428 medRxiv

Top 0.1%

18.1%

Show abstract

The use of Electronic Health Records (EHRs) has increased significantly in recent years. However, a substantial portion of the clinical data remains in unstructured text formats, especially in the context of radiology. This limits the application of EHRs for automated analysis in oncology research. Pretrained language models have been utilized to extract feature embeddings from these reports for downstream clinical applications, such as treatment response and survival prediction. However, a thorough investigation into which pretrained models produce the most effective features for rectal cancer survival prediction has not yet been done. This study explores the performance of five Dutch pretrained language models, including two publicly available models (RobBERT and MedRoBERTa.nl) and three developed in-house for the purpose of this study (RecRoBERT, BRecRoBERT, and BRec2RoBERT) with training on distinct Dutch-only corpora, in predicting overall survival and disease-free survival outcomes in rectal cancer patients. Our results showed that our in-house developed BRecRoBERT, a RoBERTa-based language model trained from scratch on a combination of Dutch breast and rectal cancer corpora, delivered the best predictive performance for both survival tasks, achieving a C-index of 0.65 (0.57, 0.73) for overall survival and 0.71 (0.64, 0.78) for disease-free survival. It outperformed models trained on general Dutch corpora (RobBERT) or Dutch hospital clinical notes (MedRoBERTa.nl). BRecRoBERT demonstrated the potential capability to predict survival in rectal cancer patients using Dutch radiology reports at diagnosis. This study highlights the value of pretrained language models that incorporate domain-specific knowledge for downstream clinical applications. Furthermore, it proves that utilizing data from related domains can improve the quality of feature embeddings for certain clinical tasks, particularly in situations where domain-specific data is scarce.

8

Thyroid Cancer Risk Prediction from Multimodal Datasets Using Large Language Model

Ray, P.

2026-03-06 health informatics 10.64898/2026.03.05.26347766 medRxiv

Top 0.1%

14.4%

Show abstract

Thyroid carcinoma is one of the most prevalent endocrine malignancies worldwide, and accurate preoperative differentiation between benign and malignant thyroid nodules remains clinically challenging. Diagnostic methods that medical practitioners use at present depend on their personal judgment to evaluate both imaging results and separate clinical tests, which creates inconsistency that leads to incorrect medical evaluations. The combination of radiological imaging with clinical information systems enables healthcare providers to enhance their capacity to make reliable predictions about patient outcomes while improving their decision-making abilities. The study introduces a deep learning framework that utilizes multiple data sources by combining magnetic resonance imaging (MRI) data with clinical text to predict thyroid cancer. The system uses a Vision Transformer (ViT) to obtain advanced MRI scan features, while a domain-adapted language model processes clinical documents that contain patient medical history and symptoms and laboratory results. The cross-modal attention system enables the system to merge imaging data with textual information from different sources, which helps to identify how the two types of data are interconnected. The system uses a classification layer to classify the fused features, which allows it to determine the probability of cancerous tumors. The experimental results show that the proposed multimodal system achieves better results than the unimodal base systems because it has higher accuracy, sensitivity, specificity, and AUC values, which help medical personnel to make better preoperative decisions.

9

A bibliometric review of explainable AI in diabetes risk prediction: Trends, gaps, and knowledge graph opportunities

Van, T. A.

2026-04-20 health informatics 10.64898/2026.04.16.26351069 medRxiv

Top 0.1%

13.8%

Show abstract

BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [->] Explainability [->] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.

10

Natural Language Processing Analysis of Australian Health Practitioner Disciplinary Tribunal Decisions, 1999-2026

Farquhar, H. L.

2026-02-17 health policy 10.64898/2026.02.13.26346299 medRxiv

Top 0.2%

12.4%

Show abstract

Natural language processing was applied to 3,586 Australian health practitioner tribunal decisions (1999-2026) to identify patterns in professional misconduct, outcomes, and temporal trends at a scale impractical through manual analysis. A text classification approach categorised 2,428 disciplinary decisions across seven misconduct types with acceptable accuracy for the major categories (per-class F1 0.47-0.82). Boundary violations were the most prevalent misconduct type (30.2%), followed by dishonesty/fraud (29.7%) and professional conduct breaches (28.0%). Reprimand was the most common outcome (53.0%), followed by cancellation (40.2%). Significant increasing trends were identified for boundary violations, dishonesty/fraud, professional conduct breaches, and communication failures. Boundary violations were associated with higher cancellation odds (OR = 1.36, p < 0.001). Opioid medications appeared in 67% of prescribing misconduct decisions. Significant jurisdictional variation in both misconduct types and outcomes was observed, with large effect sizes between major jurisdictions. The findings provide an empirical foundation for monitoring disciplinary trends under the National Law.

11

Improving mortality prediction in critically ill cancer patients with a multidimensional machine learning model

Nieto Estrada, V. H.; Aya Porto, A. C.; Cardona Zorrilla, A. F.; Pulido Ramirez, E. O.; Trujillo Gordillo, H.; Sanchez Pineros, N. G.; wagner gutierrez, N.; Arrieta, O.; Molano, D. f.; Rolfo, C.; Nigita, G.; Nates, J.

2026-02-03 intensive care and critical care medicine 10.64898/2026.02.02.26345349 medRxiv

Top 0.2%

12.2%

Show abstract

BackgroundPrognostic assessment in critically ill patients with cancer remains challenging, as conventional ICU severity scores often perform suboptimally in this population. Machine learning (ML) approaches may improve outcome prediction by integrating acute physiology, organ dysfunction, and oncologic variables. We aimed to develop and validate ML-based models to predict ICU mortality and 30-day survival in critically ill cancer patients. MethodsWe conducted a retrospective cohort study including 997 critically ill cancer patients admitted to the ICU. Forty-eight demographic, oncologic, physiological, laboratory, and therapeutic variables collected at ICU admission were used to train and validate ML models. Eight algorithms were evaluated using stratified cross-validation with feature selection and hyperparameter optimization. Model performance was assessed using discrimination, calibration, and classification metrics. Model interpretability was explored using Shapley additive explanations (SHAP). ResultsCatBoost achieved the best performance for ICU mortality prediction (AUROC 0.96), showing excellent discrimination and calibration, and outperforming other ML models. Prediction of 30-day survival was less accurate (best AUROC 0.75), reflecting the influence of post-ICU factors not captured at admission. Key predictors of ICU mortality included severity of organ dysfunction, therapeutic objectives, vasopressor and methylene blue use, SAPS III score, lactate, platelet count, and blood urea nitrogen. For 30-day survival, baseline physiological status, admission type, SAPS III, lactate, creatinine, age, and body mass index were most relevant. SHAP analysis demonstrated that acute physiology and organ dysfunction, rather than cancer diagnosis alone, primarily drove short-term outcomes. ConclusionsML-based models, particularly CatBoost, outperformed traditional prognostic tools for predicting ICU mortality in critically ill cancer patients. Cancer was not an independent predictor of short-term mortality; outcomes were primarily driven by pre-ICU conditions, acute physiology, and severity of organ dysfunction. External validation is needed to confirm generalizability and support future integration of ML-based prediction tools into clinical decision-making in oncologic critical care.

12

Extending the OMOP Common Data Model to Support Observational Peripheral Vascular Disease Research

Leese, P. J.; McIntee, T.; Browder, S. E.; Laivuori, M.; Alabi, O.; McGinigle, K. L.

2026-02-03 health informatics 10.64898/2026.02.01.26345276 medRxiv

Top 0.2%

12.2%

Show abstract

BackgroundPeripheral artery disease (PAD) and chronic limb-threatening ischemia (CLTI) cause substantial morbidity and mortality, yet research progress is limited by fragmented, non-standardized data. The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) provides a standardized framework for electronic health record (EHR) research but lacks domain-specific detail for peripheral vascular diseases. This study aimed to develop and test a vascular-specific OMOP CDM extension to improve data standardization, enable reproducible real-world analyses, and support precision medicine research in PAD and CLTI. MethodsWe identified patients with PAD, CLTI, or diabetic foot ulcers who sought care within the UNC Health System between April 2014 and July 2024. Standard OMOP tables were supplemented with peripheral vascular laboratory (PVL) data and state death records. Intermediate tables were designed for key clinical domains (e.g., smoking, comorbidities, revascularizations) to enhance reusability. Predictive models for revascularization and mortality were developed using logistic regression with Bayesian weighting and Markov Chain Monte Carlo feature selection. Clinical ApplicationThe revascularization model displayed high performance with and without important vascular variables (AUC = 0.970 and AUC 0.969, respectively), while the mortality model demonstrated moderate accuracy (AUC = 0.656) that improved with inclusion of vascular-specific features (AUC = 0.752). ConclusionsThis vascular OMOP extension represents one of the first specialty-specific frameworks for peripheral vascular research. By extending the OMOP CDM to a vascular domain, this work advances both the technical framework and scientific capability of real-world data research in limb preservation and precision vascular medicine.

13

ChatGPT with Mixed-Integer Linear Programming for Precision Nutrition Recommendations

Alkeyeva, R.; Nagiyev, I.; Kim, D.; Nurmanova, B.; Omarova, Z.; Varol, H. A.; Chan, M.-Y.

2026-02-17 health informatics 10.64898/2026.02.14.26346312 medRxiv

Top 0.2%

12.2%

Show abstract

BackgroundThe growing interest in applying artificial intelligence in personalized nutrition is challenged by the complex nature of dietary advice that must balance health, economic, and personal factors. Though automated solutions using either Linear Programming (LP) or Large Language Models (LLMs) already exist, they have significant drawbacks. LP often lacks personalization, whereas LLMs can be unreliable for precise calculations. ObjectivesTo develop and assess a model that integrates a Mixed Integer Linear Programming (MILP) solver with an LLM to generate personalized meal plans and compare it with standalone LLM and MILP models. MethodsThe proposed hybrid MILP+LLM model first uses an LLM (GPT-4o) to filter a unified food dataset (n=297), which combines regional Central Asian and global food items, according to the users profile. The filtered list of food items is then received by a MILP solver which identifies the set of top 10 optimal solutions. Finally, given this set of solutions, LLM chooses the most appropriate meal plan. The model was evaluated using five synthesized, clinically complex patient profiles sourced from Adilmetova et al. [4]. The performance of this hybrid model was compared against standalone MILP and LLM using 5-point Likert scale with Kruskal-Wallis and post hoc Dunns tests for Nutrient Accuracy, Personalization, Practicality, and Variety. ResultsFindings demonstrated that the proposed MILP+LLM model reached balanced performance achieving scores of more than 3.6 points in all criteria, with high scores in Nutrient Accuracy (3.96), Personalization (3.81), and Practicality (3.99). The standalone LLM model performed the weakest in all criteria, with statistically significant lower scores compared to the other two methods. The standalone MILP model performed best in Nutrient Accuracy (4.93) and in Variety (4.10) but lagged behind the MILP+LLM model in Practicality and Personalization. Kruskal-Wallis and Dunns tests showed MILP and MILP+LLM outperformed LLM across all criteria. MILP was more accurate (p<0.0001), while MILP+LLM model was more practical (p=0.021). ConclusionsThe findings suggest that integrating the LLM with the MILP solver creates a model that combines qualitative personalization with quantitative precision. This model produces comprehensive, reliable meal plans, addressing the limitations of using either model alone.

14

Runtime Anomaly Detection and Assurance Framework for AI-Driven Nurse Call Systems

Liu, Y.; Concepcion, D.

2026-03-18 health informatics 10.64898/2026.03.16.26348563 medRxiv

Top 0.2%

12.1%

Show abstract

This research proposes an anomaly detection and assurance framework. It is mainly aimed at providing a framework for anomaly detection and assurance in AI-driven Nurse Call Systems (NCS) during operation. This study detects abnormal behaviors through simulating real call logs, injecting controllable anomalies, and using a lightweight Isolation Forest model. The final visualization results are presented through an interactive dashboard. Our research focuses mainly on the medical environment, which has characteristics of being delay-sensitive and safety-critical. A distinctive feature of this research is that it can effectively enhance the reliability of system operation without relying on complex deep model proprietary data, while maintaining safety and interpretability. The framework design emphasizes reproducibility while maintaining low computational overhead. The purpose is to enable rapid deployment of this framework on resource-constrained edge devices. Preliminary experimental results show that this method can maintain a reasonable precision rate. Additionally, when detecting delay-type anomalies, the results indicate a high recall rate. Moreover, to reflect the systems performance in real scenarios, the framework detects delay metrics and hourly alarm quantity metrics, and reports Precision-Recall curves and their confidence intervals. Future work will consider introducing time, context features, and explainability analysis modules. The aim is to improve the models accuracy and further meet the medical industrys requirements for auditability. This work focuses on the operational safety and reliability of AI-enabled Nurse Call Systems, addressing runtime failure modes that are underrepresented in current healthcare AI deployments. Rather than proposing new learning models, the contribution lies in a reproducible, interpretable assurance framework suitable for real clinical infrastructure. To ensure transparency and reproducibility, all code, cleaned datasets, experiment scripts, and an interactive Streamlit demo--allowing users to upload their own CSVs -- are publicly released as open research artifacts (Zenodo DOI: 10.5281/zenodo.17767143).

15

The Evolution and Equity of Chinas Pharmacist Workforce in Healthcare Institutions: A Provincial Panel Data Analysis, 2007-2023 Evolution and equity of China's pharmacist workforce

xia, y.; Sun, L.; Zhao, Y.

2026-04-23 health policy 10.64898/2026.04.22.26351514 medRxiv

Top 0.2%

10.5%

Show abstract

Background: China has implemented policies to strengthen its pharmacist workforce since the 2009 healthcare reform, yet a comprehensive evaluation of their long-term systemic effects is lacking. Objective: To systematically analyze the evolution of Chinas pharmacist workforce in healthcare institutions from 2007 to 2023 across four dimensions: quantity, quality, structure, and distribution, providing an empirical foundation for policy optimization. Methods: A retrospective analysis was conducted using longitudinal data from the China Health Statistics Yearbooks. Trends were delineated via descriptive statistics. Equity and spatial evolution were assessed using the Gini coefficient, Theil index decomposition, and spatial autocorrelation analyses (Global Morans I and hotspot analysis). Results: From 2007 to 2023, the total number of pharmacists increased from 357,700 to 569,500 (average annual growth: 2.2%). This growth lagged behind physicians (4.6%) and nurses (7.4%), causing the pharmacist-to-physician ratio to decline from 1:5.15 to 1:8.39. The workforce showed trends of feminization (female proportion rose from 59.7% to 70.8%) and aging. While quality improved, 51.1% still held an associate degree or below, and only 6.6% held senior titles. Equity analysis revealed the provincial Gini coefficient improved from 0.145 to 0.093. Theil index decomposition confirmed intra-provincial disparities as the primary inequality driver. Spatial analysis showed a non-significant global Morans I by 2023 (0.154, P*>0.05), down from 0.254 (P<0.01) in 2007. Hotspot analysis confirmed this transition, revealing a contraction of high-confidence clusters and a trend toward balanced distribution. Conclusions: China has made measurable progress in expanding pharmacist workforce size and improving inter-provincial equity since 2007. However, persistent structural challenges remain: relative workforce contraction compared to other health professions, an aging demographic, a shortage of senior talent, and significant intra-provincial inequity. Future policies must prioritize optimizing workforce structure and enhancing clinical service capabilities to catalyze a shift toward patient-centered pharmaceutical care.

16

Development of a natural language processing application to extract and categorize mentions of violence from mental healthcare records text

Li, L.; Sondh, S.; Sondh, H. K.; Stewart, R.; Roberts, A.

2026-03-26 health informatics 10.64898/2026.03.22.26348435 medRxiv

Top 0.2%

10.3%

Show abstract

BackgroundExperiences of violence are reported frequently by mental health service users, victims of violence are at a greater risk of mental health disorders, and violence may sometimes occur as a consequence of a mental disorder. Electronic health records (EHRs) are an important source of information about healthcare, and its social context. Occurrences of violence are not routinely recorded as structured data in EHRs but are however recorded in the free text narrative. ObjectiveOur objective was to address this research gap by creating a natural language processing (NLP) application that extracts information related to various forms of violence (physical (non-sexual), sexual, emotional, and financial) from the EHR of a large south London mental health service. Additionally, we aimed to extract features concerning the patients role (victimization vs. perpetration), timing (recent vs. historic), domestic context, presence (actual, threat, or unclear), and polarity (affirmed, abstract, or negated) of the violent behaviors. MethodsTwo raters independently annotated 6,500 randomly selected segments of clinical notes containing violence-related keywords from a large mental healthcare provider in South London, each containing 400 characters (with approximately 200 characters before and after the keyword) after rigorous training using a pre-defined and approved coding book provided by senior professionals. We utilized 90% of the annotated data for fine-tuning a multi-label BERT model (employing 5-fold cross-validation) with the remaining 10% of data reserved for a blind test. ResultsThe model performed well on the blind test set for emotional violence (F1= 0.89), financial violence (0.88), physical (non-sexual) violence (0.84), and unspecified violence (0.81), and the patient role (0.89 as perpetrator; 0.84 as victim), polarity (0.89 for affirmed behavior), presence (0.95 for actual violence), and domestic settings (0.88). We were unable to achieve satisfactory results in capturing temporal aspects (0.65 for past violence). ConclusionsWe were able to improve substantially on previously developed NLP for ascertaining violence in routine mental health records, providing novel opportunities for both surveillance and research.

17

Identification of Suicide-Related Subgroups Using Latent Class Analysis: Complementary Insights to Explainable AI-Based Classification

Kizilaslan, B.; Mehlum, L.

2026-03-27 psychiatry and clinical psychology 10.64898/2026.03.25.26349264 medRxiv

Top 0.2%

10.1%

Show abstract

Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning

18

The Risk Factors, Detection and Classification of Esophageal Cancer Using Ensemble Machine Learning Models

Gaso, M. S.; Mekuria, R. R.; Cankurt, S.; Deybasso, H. A.; Abdo, A. A.; Abbas, G. H.

2026-03-11 health informatics 10.64898/2026.03.09.26347944 medRxiv

Top 0.2%

10.0%

Show abstract

Esophageal cancer (EC) remains one of the most lethal malignancies worldwide, with poor survival outcomes largely attributable to late-stage diagnosis and limited treatment effectiveness. Early detection and accurate risk stratification are therefore essential for improving clinical management. In this study, we investigate the predictive value of socio-demographic, dietary, behavioral, environmental, and clinical variables collected from 312 individuals (104 EC cases and 208 controls) in the Arsi Zone, Ethiopia. An ensemble features ranking approach based on Random Forest machine learning was first applied to identify the most relevant predictive features. Subsequently, multiple ensemble machine learning models were evaluated, including Histogram-based Gradient Boosting (Model I), Extreme Gradient Boosting (Model II), AdaBoost (Model III), Random Forest (Model IV), and k-Nearest Neighbors (Model V). These models were tested under multiple experimental settings using both full and reduced feature subsets. To enhance robustness and minimize variability, a multi-seed ensemble framework was employed. Different seed values generate distinct train-test splits and slight variations in model initialization and optimization, leading to minor differences in training outcomes; aggregating results across multiple seeds mitigates this variability and provides more stable and reliable performance estimates. The experimental results demonstrate that boosting-based ensemble models consistently outperform other classifiers across all evaluation metrics. Model I achieved the highest overall performance, reaching an accuracy of 0.983, with precision of 0.982, recall of 0.980, and F1-score of 0.981 using the reduced feature set, while maintaining nearly identical performance with the full feature set. Model II also showed stable and strong predictive capability, achieving accuracies of 0.963 and 0.961 for the full and reduced feature sets, respectively, with balanced precision, recall, and F1-score values. These findings indicate that feature importance-based dimensionality reduction preserves essential predictive information without compromising classification performance. Overall, the results highlight the significant predictive contribution of dietary and environmental risk factors and demonstrate that ensemble learning provides a reliable, efficient, and clinically meaningful approach for early EC detection. The proposed framework offers a promising direction for supporting diagnostic decision-making and risk stratification in resource-limited healthcare settings. HighlightsO_LIMachine Learning Framework for Esophageal Cancer Classification A robust ensemble machine learning framework was developed to classify esophageal cancer using socio-demographic, dietary, behavioral, environmental, and clinical risk factors, enabling accurate and reliable disease prediction. C_LIO_LIMulti-Seed Ensemble Strategy for Improved Model Stability A novel multi-seed ensemble classification approach was implemented to reduce model variance and improve robustness by aggregating predictions across multiple randomized training and testing splits. C_LIO_LIEnsemble Feature Ranking for Optimal Feature Selection An ensemble Random Forest-based feature ranking framework was designed to identify the most predictive features, ensuring stable biomarker selection and improved model interpretability. C_LIO_LIHigh Classification Performance with Reduced Feature Set The proposed ensemble HGBC model achieved outstanding performance with 98.3% accuracy, 98.2% precision, 98.0% recall, and 98.1% F1-score using a reduced feature subset, demonstrating efficient dimensionality reduction without performance loss. C_LIO_LIExceptional Discriminative Ability with Near-Perfect AUC The ensemble HGBC model achieved an AUC of 0.994, indicating excellent discrimination between cancer and non-cancer cases and confirming its suitability for high-precision clinical decision support. C_LIO_LIZero False-Negative Predictions and Maximum Diagnostic Sensitivity The proposed model achieved zero false negatives in evaluation, resulting in 100% statistical power and perfect sensitivity, ensuring reliable detection of esophageal cancer cases. C_LIO_LIIdentification of Key Dietary and Environmental Risk Factors Feature importance analysis revealed that dietary habits, hot food consumption, environmental exposures, and behavioral factors are among the most significant predictors of esophageal cancer risk. C_LIO_LIEnsemble Learning Outperforms Traditional Machine Learning Models Boosting-based ensemble models, particularly HGBC and XGBoost, consistently outperformed other classifiers, demonstrating superior predictive accuracy, stability, and robustness. C_LIO_LIEfficient and Interpretable AI Framework for Clinical Decision Support The proposed framework balances high predictive accuracy with interpretability, making it suitable for assisting clinicians in early diagnosis and risk stratification of esophageal cancer. C_LIO_LIAI-Driven Solution for Resource-Constrained Healthcare Settings The proposed ensemble machine learning approach provides an effective and scalable diagnostic support tool, particularly valuable for healthcare systems with limited resources and access to specialized medical expertise. C_LI

19

Machine Intelligence-Driven Forecasting for ED Triage and Dynamic Hospital Patient Routing

Dharmavaram, S.; Bhanushali, P.

2026-02-20 emergency medicine 10.64898/2026.02.18.26346566 medRxiv

Top 0.2%

9.9%

Show abstract

Overcrowding of emergency departments (ED) is now a problem of global health care concern due to the increase in patients. Triage systems have been established for a considerable period. However, their reliability in choosing the appropriate patient and the level of service has undergone much scrutiny. In this paper, we describe a comprehensive machine learning framework aimed at predicting critical emergency department outcomes and enabling dynamic routing decisions. Through the MIMIC-IV-ED database, which comprises more than 440,000 emergency visits, we design and assess varied predictive models, which include classical clinical scores, interpretable ML systems, classical algorithms, and deep learning architectures. We investigate three significant outcomes: hospitalization post-ED visit, critical deterioration (ICU transfer/death within 12 hours), 72-hour re-attendance in ED. The results indicate that gradient boosting algorithms can make better predictions with AUROCs of 0.820, 0.881, and 0.699 as compared to standard clinical scoring systems and complex deep learning models. The interpretable AutoScore framework which combines clinical performance with clinical transparency. We also study patterns of feature importance across prediction tasks. Moreover, we talk about how these can be implemented in real-time clinical workflows. This study builds a reproducible benchmarking platform for ED prediction research. In addition, it presents evidence-based recommendations for intelligent patient routing systems that can help enhance emergency care efficiency and resource utilization while improving patient outcomes in a high-pressure environment.

20

ECG spectrogram-based deep learning model to predict deterioration of patients with early sepsis at the emergency department: a study from the Acutelines data- and biobank

van Wijk, R. J.; Schoonhoven, A. D.; de Vree, L.; Ter Horst, S.; Gaidhane, C.; Alcaraz, J. M. L.; Strodthoff, N.; ter Maaten, J. C.; Bouma, H. R.; Li, J.

2026-03-27 emergency medicine 10.64898/2026.03.26.26349371 medRxiv

Top 0.3%

9.3%

Show abstract

Purpose: Early recognition of deterioration in patients with suspected infection at the emergency department (ED) is important. Current clinical scoring systems show limited discriminative performance for early deterioration. Continuous electrocardiogram (ECG) recordings may offer additional dynamic physiological information that can enhance early prediction of deterioration in patients with suspected infection. Methods: We developed a multimodal, ECG-derived spectrogram-based pipeline to predict deterioration within 48 hours of ED admission. We used the first 20 minutes of ECG recordings for the spectrograms. We compared the model with the National Early Warning Score (NEWS), quick Sequential Organ Failure Assessment (qSOFA), a baseline model with vital parameters, sex, and age, and a Heart Rate Variability (HRV) derived model. Results: In this study, 1321 patients were included, of whom 159 (12%) deteriorated. The multimodal model combining baseline data with spectrograms showed the best overall performance, with an Area Under the Receiver Operating Characteristic (AUROC) of 0.788, followed by the baseline model (age, sex, triage vitals) alone, with an AUROC of 0.730. The HRV-only model and the qSOFA showed the lowest performance (AUROC 0.585 and 0.693, respectively). Conclusion: This study shows that ECG-derived multimodal spectrogram models outperform those based solely on vital signs and HRV features, as well as established clinical scores such as NEWS and qSOFA. Spectrogram analysis represents a promising approach to enhance early risk stratification and support clinical decision-making for patients with suspicion of infection in the ED.